Classification of Semantic Concepts to Support the Analysis of the Inter-cultural Visual Repertoires of TV News Reviews

نویسندگان

  • Martin Stommel
  • Martina Duemcke
  • Otthein Herzog
چکیده

TV news reviews are of strong interest in media and communication sciences, since they indicate national and international social trends. To identify such trends, scientists from these disciplines usually work with manually annotated video data. In this paper, we investigate if the time-consuming process of manual annotation can be automated by using the current pattern recognition techniques. To this end, a comparative study on different combinations of local and global features sets with two examples of the pyramid match kernel is conducted. The performance of the classification of TV new scenes is measured. The classes are taken from a coding scheme that is the result of an international discourse in media and communication sciences. For the classification of studio vs. non-studio, football vs. ice hockey, computer graphics vs. natural scenes and crowd vs. no crowd, recognition rates between 80 and 90 percent could be achieved. 1 Analysis of Visual Repertoires in Media and Communication Sciences The development of our society as documented in TV news reports is subject to research in media and communication sciences. While the contents of a news report itself is of high importance, media and communication scientists are aware of more subtle but also crucial sources of information: The structure of the scene setup may for example suggest a certain social role of the actors. The meaning of a scene also does not only depend on the video data but also on the cultural background of the viewer. And often it is more conclusive to identify issues that have been omitted compared to those actually addressed. TV news are suited well to study such questions. The constant process of production, repetition and summarisation of TV news and news reviews results in video representations of the most relevant events of our society in very concise form [2]. The symbolic value as well as the high spread of these representations make them interesting for comparison across countries or years. 1 A short version of this article has been published at the KI 2011 conference [1]. The analysis usually includes a lot of manual video annotation. Research efforts in different countries resulted in a coding sheet that states the most important items for annotation [3]. Additional items are included to handle specific research questions. To reduce the influence of personal background and understanding, the annotation is conducted by specialists that have been trained for a high inter-coder reliability, i.e. a high agreement in the annotations. The inter-coder reliability, measured as Krippendorff’s alpha, reaches an agreement of more than 70 percent, under good conditions. The annotation is used to compare the depictions of people and events over different countries or years. In this paper, we study if the process can be facilitated by using current pattern recognition techniques. To this end, we chose four items with low symbolic connotation from the annotation scheme. The items are studio/non-studio, football/ice hockey, computer graphics/natural scenes and crowd/no crowd. The pyramid match kernel is trained to classify these items based on a set of local and global detectors and descriptors. Using the optimal feature configurations, we achieve excellent recognition rates for all classes. 2 Computational Approaches Computational approaches consist of preprocessing, feature extraction and classification steps [4]. For some industrial computer vision applications this may be a straight process chain. The classification of TV material with its contextual cross references and rich semantics requires a more complex procedure in multiple stages. The idea of a multi-stage or hierarchical procedure can already be found in earlier connectionist approaches [5]. The approaches are justified biologically [6], psychologically [7] or statistically [8]. The structure and understanding of the hierarchy is application dependent. For the case of TV material, Dorai and Venkatesh [9] distinguish between a high and a low level in their theoretical framework. The high level deals with the narrative form and the arrangement of scenes and effects by the filmmaker. Low level features on the other hand are characterised as rather formal properties that can be extracted from single frames or shots. Practical efforts to reach the high level are connected to the notion of a semantic concept [10]. On an intermediate level, semantic concepts are named objects or scene types. The name distinguishes them from strictly syntactical low-level features. Finer, sometimes recursive subdivisions of objects into their parts have been proposed (e.g. [11, 6, 12, 7]). Hauptmann et al. [13] extrapolate from measurements on 300 TRECVid concepts and conclude that a few thousand concepts with moderate recognition accuracy might be sufficient to reliably retrieve news videos. While low-level syntactical features do not allow for a reliable scene classification [13], they achieve a certain invariance against illumination and deformation. The influence of illumination and pose on the object appearance has been visualised by Murase and Nayar [14], allowing them to model the appearance directly by using principle component analysis. Garg et al. [15] provide theoretical and practical results that the dimensionality of scene appearances under natural conditions can be reduced to a number of 10 to 30 dimensions without visual loss using principle component analysis. In most cases the scene appearance is not modelled directly. Instead, semantic concepts are usually represented by sets of local feature vectors [16–19] trained by machine learning algorithms [20]. A popular approach is to subdivide the feature space into bins that can be used to compute histograms over the feature space or to span simplified new feature spaces [21–23, 17]. The subdivision can be general purpose or optimised to a particular semantic concept [24]. To a certain degree, the trained feature sets resemble the alphabet of moderately complex features found by Tanaka [25] in the inferior temporal cortex. Because geometrical dependencies often cause high computational costs, these approaches often follow the bag-of-features principle. However, experiments on different types of constellation models indicate advantages for the use of geometry [26] depending on the level of abstraction [12]. Some studies therefore aim at incorporating geometrical information [27–30]. Yang et al. [31] propose a scene classification based on motion features. Recent results indicate that the time consuming clustering of local features can be simplified by creating a random alphabet of visual words given a sufficient size of the alphabet [32, 33] and a proper pooling function [11]. 3 Experimental Setup In our analysis we evaluate two versions of Grauman and Darrell’s Pyramid Match Kernel [34, 35] in combination with four interest point detectors, four feature descriptors, and three global features. The Pyramid Match Kernel compares histograms of the input data based on the simultaneous histogram intersection at multiple bin widths. The kernel function is then be used with a Support Vector Machine. While the original version uses bins that are aligned to regular grids, a later version [35] performs a hierarchical clustering to align the bins to the distribution of the data. The Pyramid Match Kernel is used to classify local and global image features both separately as well as in combination. Feature combinations are represented by concatenating their descriptors. Local features are computed at interest points detected by Speeded Up Robust Features (SURF) [36], Maximally Stable Extremal Regions (MSER) [37], and Harris corner points obtained in the Harris-Affine or Hessian-Affine version [38]. These local detectors are combined with four feature descriptors. The descriptors are the one proposed in the Speeded Up Robust Features, then the location of a feature point (i.e. the image coordinate), Steerable Filters [39] and Shape Context [40]. As global features we use colour histograms in two versions: Global colour histograms are build by concatenating the intensity histograms of the three colour channels. Local colour histograms are the concatenation of all colour histograms computed in the cells of a regular grid with a spacing of 16 pixels placed over Fig. 1. Three samples from the studio (on the left) and non-studio class (on the right). the image. The presence or absence of faces is used as a third global feature [41]. The aim of this setup is to benefit from complementary information, e.g. colour and texture. The classification is conducted on single frames that are representatively chosen. Every frame stands for a shot in a TV news review and is annotated by the binary categories studio vs. non-studio, football vs. ice hockey, computer graphics vs. natural scenes and crowd vs. no crowd. The sample sizes are each 200 frames for studio and no studio. The images are taken from 400 shots of ABC and CBS TV news reviews from 1999, 2001, and 2003–2009. For the categories football and ice hockey, each 50 frames are chosen from ARD and ZDF news reviews from 2008–2010. The categories computer graphics and natural are represented by each 50 frames from ABC and CBS news reviews from 1999–2000, 2003–2006 and 2008. The recognition of crowds is tested with each 40 positive and negative samples of ABC and CBS news reviews from 1999, 2001, 2005 and 2008. The images are randomly split into equally sized training and test samples. Special care is taken that no frames of the same video are present in the training and test set at the same time. This is to exclude spurious matches between Fig. 2. Three samples from the football (on the left) and ice hockey class (on the right). related shots of a longer scene. The figures 1, 2, 3 and 4 show three samples for each class. 4 Experimental Results Figure 5 shows the accuray of the classification of studio scenes using the original Pyramid Match Kernel. Comparatively high results of up to 77 per cent are obtained for the SURF descriptor in combination with MSER or one of the corner detectors. Texture and edges therefore seem more important for the studio class than colour. Faces also appear as a good feature and it seems that the classifier recognises studio frames by the anchor person. However, most combinations yield only recognition rates slightly better than random. The hierarchical clustering introduced later [35] leads to a significant improvement for almost all feature types. Figure 6 shows the accuracy. Experiments on the number and depth of the branches of the cluster hierarchy show that a proper alignment of the Match Kernel to the data distribution is indeed crucial. Our results therefore validate the observations by Grauman and Darrell [35]. With the better alignment, the best results are now obtained for feature configurations including the shape context. In the following, all results are obtained using the hierarchical clustering in the preprocessing. As fig. 7 shows, the combination of multiple detectors increases the accuracy to more than 81 per cent. However, the increase in accuracy is balanced by the computational cost to handle a higher number of interest points. The figure also Fig. 3. Three samples from the computer graphics (on the left) and natural scene class (on the right). shows that the combination of multiple descriptors instead of multiple detectors decreases the accuracy. The result shows that the trade-off between the fusion of complementary information and numerical stability is still a non-trivial problem. This is also in accordance to obervations by Haupmann et al. [13] on the combination of semantic concepts. The classification of the sport type can be handled very well by the experimental setup. The best feature combination reaches an accuracy of 98 per cent (see fig. 8). The highly dynamic scenes are handled best by the SURF detector and descriptor, while the feature location proves inappropriate here. The predominance of either white or green background (see fig. 2) is reflected in the good results for the colour histograms. The frequent occurrence of the audience at the top margin of the images might explain the advantage of the local colour histograms. The good contrast of the computer generated TV news shots seems to match the MSER detector combinations best with an accuracy of 72 per cent on the average (see fig. 9). The high performance of the location descriptor with the best accuracy of up to 77 per cent in combination with the Harris-Affine interest operator can be explained by the static nature of the video type. Computer Fig. 4. Three samples from the crowd (on the left) and non-crowd class (on the right). animations are also frequently repeated without significant change since they form a distinguishing feature of a TV news show. The accuracy for the recognition of crowds is shown in fig. 10. The results are good for most local features including local colour histograms. A maximum of more than 89 per cent is reached for the SURF descriptor combined with either the SURF or Hessian-Affine interest point detector. The clear advantage over the results for the global colour histogram indicates that geometry is a crucial feature for this class. The face detector performs bad in the recognition of a crowd. Although many faces are present, faces are often occluded or too small to be detected. Also, the skin colour analysis might be disturbed by badly illuminated faces and faces that blur with the background.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Thematic analysis of the news of the 2020 Tokyo Olympics with emphasis on gender(case study: Shargh news paper)

  abstract:   The purpose of writing this article is to thematically analyze the news of the 2020 Tokyo Olympics by emphasizing gender and presenting an indigenous model of its related components using the theories of experts. The text of the Tokyo 2020 Olympic event is in Shargh 1400 newspaper (August 1 - August 17) which is a purposeful sampling, first based on commonalities, related them...

متن کامل

The Ideology of Iranian National Television in News Presentation: A Critical Discourse Analysis Perspective

Media in general and news in particular constitute indispensable parts of modern life. TV news which contains visual elements in addition to the verbal aspects was used as the corpus of this study. In fact, this research culled out the ideologies of Iranian national television in presenting different issues. To be more precise, three internationally-significant pieces of news reports were chose...

متن کامل

‘Representational Repertoires’ of Neoliberal Ideologies in Interchange (Third Edition) Series

Considering the fact that engagement with political economy is central to any fully rounded analysis of language and language-related issues in the neoliberal-stricken world today, and that applied linguistics has ignored the role of political economy (Block, Gray, & Holborow, 2012),  for the first time, this study investigated the representations of neoliberal ideologies in the Interchange Thi...

متن کامل

Studying the Prominence-based Situation in Iranian Television News from a Persuasion Perspective

This article is part of a much broader study of persuasive techniques in television news in Iran. As a very important part of IRIB news IRINN has been chosen for this study. If we assume that one of the most influential television news, the power of persuasion to influence the audience is very important. It seemed that choosing the highest hypothetical level of news and deconstructing it would ...

متن کامل

The need to create a media block for the convergence of overseas news networks

As a general diplomacy arm of the Islamic Republic of Iran, VoSiMa has extensive activities in international broadcasting of its radio and television programs. These programs are broadcast in different languages, such as English, French, Azeri, Arabic, and ... for regional and transnational audiences. The large volume of the organization's international activities is in the form of news and new...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011